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Abstract 

We present a novel right-to-left long division algorithm based on the Montgomery modular multiply, 
consisting of separate highly efficient loops with simple carry structure for computing first the remainder 
[x mod q) and then the quotient [x/q\ . These loops are ideally suited for the case where x occupies many 
more machine words than the divide modulus q, and are strictly linear time in the "bitsize ratio" lg x/ lg q. 
For the paradigmatic performance test of multiword dividend and single 64-bit- word divisor, exploitation 
of the inherent data-parallelism of the algorithm effectively mitigates the long latency of hardware integer 
MUL operations, as a result of which we are able to achieve respective costs for remainder-only and full- 
DIV (remainder and quotient) of 6 and 12.5 cycles per dividend word on the Intel Core 2 implementation 
of the x86_64 architecture, in single-threaded execution mode. We further describe a simple "bit-doubling 
modular inversion" scheme, which allows the entire iterative computation of the mod-inverse required 
by the Montgomery multiply at arbitrarily large precision to be performed with cost less than that 
of a single Newtonian iteration performed at the full precision of the final result. We also show how 
the Montgomery-multiply-based powering can be efficiently used in Mersenne and Fermat-number trial 
factorization via direct computation of a modular inverse power of 2, without any need for explicit 
' radix-mod scalings. 

O 



> 
oo 



- 1—1 

X 



c3 



1 Introduction 



In an ingenious 1985 work [10] . Montgomery presented an algorithm for division- free modular multiplication 
( "modmul" ) which is ideally suited for modern binary compute hardware. For general background we refer 
£S) ■ the reader to Crandall & Pomerance [5] , who note that the method is a generalization of an old method of 

Hensel [1] for computing the inverses of 2-adic numbers, and who further give a nice survey of subsequent 
developments and algorithmic refinements, most of which are also detailed in the more-encyclopedic reference 
CO ! by Menezes et al. [8] . C&P make specific note of the "tough case" when the number x is much larger than 

the modulus q in the log-ratio sense, that is if n is the smallest integer such that x < q n , then we have n 1. 
They remark on several highly-tuned computer algorithms for this case, all based on variants of the Barrett 
mod technique [1 . In the present work we show how the Montgomery modmul can be successfully brought 
to bear on both the general but especially the latter (n large) case. The resulting division method consists 
of the following three steps, where R is the binary arithmetic radix used for the core Montgomery modmul 
jj] \ operations: 



1. (Algorithms A and B) Computes a scaled remainder s, related to the true remainder r = x (mod q) 
by r = — s ■ R n (mod q) for Algorithm A, and r = +s ■ for Algorithm B; 

2. (Algorithm C) Efficiently computes the needed modular power of the radix R via a divide-and-conquer 
scheme, and returns the true remainder r; 

3. (Algorithm D) Takes x and r and produces the quotient y = \x/q\. 

Steps 1 and 3 are performed using simple loops which run through the elements of the (presumably multiword) 
vector x in ascending order, and possess quite-similar loop-body arithmetic structure. Step 1 is all that is 
needed if a simple binary "does q divide x?" result is required. Step 2 uses similar core modmul arithmetic 
to compute the scaling factor needed to transform the Step 1 result out of "Montgomery space" , yielding the 
remainder (x mod q), which is used (in addition to a;) as an input to Step 3. Steps 1 and 3 each require 0(n) 
size-g modmul operations (one per loop execution), whereas Step 2 needs just 0(lgN) modmuls. Thus, 
the scheme possesses the property that the work required to compute just the remainder (x mod q) is 
asymptotically half that needed to obtain both the remainder and quotient. 
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Let us first introduce some basic arithmetic and computational notation related to the various core 
multiply operations. Given nonnegative integers x and y whose product is desired with respect to some 
positive integer modulus q (such that < x, y < q), the Montgomery method begins with an arithmetic 
radix (or "base") R, typically chosen for computational convenience. R and q must be coprime, thus the 
modular inverse qinv of q exists, such that 



The Montgomery multiply does not directly return x ■ y (mod q) ; rather it returns a "shifted mod" involving 
the inverse of the radix R, thus we include R in the argument list as a parameter: 



which needs suitable adjustment if the "true mod" x ■ y (mod q) is required. Often it is not - for example 
in the fast divisibility algorithm we shall give in the next section, we only care if the output is zero or not, 
obviating any need for postprocessing arithmetic - but when it is, the extra effort needed to scale the output 
in order to produce the true mod may be appreciable. The need to perform this scaling-away of the inverse 
power of the arithmetic radix is the major reason why the Montgomery multiply has not driven other modmul 
algorithms from the field in more or less any context requiring a modmul, although it has become ubiquitous 
in areas such as Difhc-Hellman and RSA-style public-key cryptography and elliptic curve arithmetic, all of 
which feature repeated multiplications with respect to a fixed modulus and thus in which radix-scalings are 
few relative to overall modmul count. In this respect long division x/q, where we especially mean "long" 
in the literal sense that \gx/lgq 3> 1, appears to be a bit of an anomaly, because although it does feature 
repeated multiply modulo q, the best-known high-performance implementations such as that of the Gnu MP 
library [3] eschew the Montgomery method, at least in any fashion resembling the one described here. 

Here we take R = 2 h , a power of 2 typically equal to the an unsigned-integer wordsize W supported by 
the compute hardware and programming language being used, or some integer power of W if the modulus 
q > W, requiring a multiword- arithmetic implementation. It is further assumed that twos-complement 
integer arithmetic is being performed both with respect to R and W. The restriction of coprimality of the 
radix and modulus thus simply reduces to that of the modulus q being odd for the Montgomery modmul to 
work. The more general case of even moduli can easily be accommodated via simple pre- and postprocessing 
to take care of any powers of 2 appearing in q, which we shall also detail. (We specifically refer the reader 
to the end of the section describing Algorithm D). 

Assuming one has computed the needed mod- inverse qinv, a typical computational implementation of 
the Montgomery modmul consists of a three-multiply sequence, followed by a mod-subtraction, which can 
be expressed in pseudocode as 

Algorithm: MONT_MUL b 

Data: Unsigned b-bit integers x,y,q,qinv, q odd and q ■ qinv = 1 (mod 2 b ) 
Result: The radix R — 2 b Montgomery product x ■ y ■ (mod q) 



ninth lo, hi; 

hi:lo = UMUL_LOHI b (:r, y); 
lo = MULLb(gm«, lo); 
lo = UMULH b (<z,Zo); 
if hi < lo then 

| return hi — lo + q; 
else 

| return hi — lo; 
end 



Here uint b denotes a 6-bit unsigned integer type (hardware-supported or emulated-multiword) , and for 
the UMUL_LOHIb output pair we mimic the colon-separated register-pair notation used to describe the MUL 
instruction output in the x86_64 instruction set architecture (ISA); arithmetically we mean hi : lo = lo+R-hi. 

The three multiply variants appearing in the function are as follows: 



q ■ qinv = 1 (mod R). 



(1) 



M(x, y,q; R) — x ■ y ■ R 1 (mod q) 



(2) 
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UMUL_LOHIb(uintb x,uintb y) 



Full 26-bit double-width unsigned multiply which 
returns the lower and upper halves of x ■ y in a pair 
of variables. For this operation the result depends 
on whether the inputs are treated as signed or un- 
signed, so we add a leading 'U' to specify that we 
mean the unsigned version. 



MULLb(wnib x,uintb y) 



Lower- half multiply, that is, x ■ y (mod R). We 
label the inputs as unsigned here, but for this op- 
eration the result is the same whether the inputs 
are treated as signed or unsigned. 



UMULH b (mnift x,uintb y) 



Upper- half multiply, returns [(x-y)/R\, the upper 
6-bit half of the unsigned product x ■ y. 



These three multiply operations are related simply as 

UMUL.LOHI b (x, y)= x -y = MULL b (>, y) + R ■ UMULH b (a;, y). 

Since R is a power of 2 we can also express MULL and UMULH in terms of simple binary shifts and masks 
of a full product: 

MULL b (x, y) = UMUL.LOHI b (x,y) & (2 fc - 1); 
UMULH b (a;, y) = UMUL.LOHI b (x, y) > 6. 

The three-multiply sequence UMUL_LOHI /MULL /UMULH is followed by mod-q subtraction of a pair of 
6-bit quantities, typically effected an if/else branch (as in our pseudocode), or via a conditional- mov-style 
operation, either of the explicit kind if the hardware supports it, or of the "hand-rolled" bitwise-mask-and- 
add variety. 

2 Notes on the Newtonian Iterative mod-Inverse 

The multiplication algorithm first requires computation of qinv, the multiplicative inverse of q modulo R. 
This is a modular analog of the inverse over the reals, and in practice one even uses the same Newtonian 
iterative-inversion algorithm to compute qinv as is commonly used to effect a fast floating-point inverse 
computation needing no hardware division, the latter being generally a slow operation in both the floating- 
point and integer domains. One starts with an initial guess qinvg having one or more low-order "good bits" 
in the sense that qinvQ ■ q = 1 (mod 2 m ) with m > 1. One advantage of the Newtonian iteration in the 
modular realm is that as long as even the least-significant bit of qinv is correct, convergence is guaranteed. 
So since q odd we can take qinvg to be any odd number, e.g. 1 or even q itself. Montgomery (personal 
communication) alternatively suggests the following choice, which amounts to a kind of low-precision bitwise 
table-lookup in the form of a simple inline formula: 

qinv = XOR(3<7,2), (3) 

(where XOR denotes binary exclusive-or) which guarantees at least the low 5 bits of the resulting initial 
iterate to be correct. The ensuing Newton iteration involves repeated steps of form 

qinvj — qinv j -i * (2 — q * qinvj-i), (4) 

Where the asterisk operator is a shorthand for the MULL b operation (that is, multiplication modulo 2 b ) 
on the pairs of operands it separates, either with respect to some fixed arithmetic base 2 b > q, or to some 
variable power of 2 whose number of bits increases with each iteration to keep pace with the per-iteration 
doubling of number of converged bits of qinv, and which exceeds q only on the final iteration. The advantage 
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of taking R to be a power of the natural integer wordsize is apparent here, since e.g. if R = W we really can 
use the hardware multiply operation denoted in programming languages such as C and Fortran by '*', that is, 
integer product modulo W . For R > W we replace the literal '*' by a software-emulated MULLb operation, 
which for R not terribly larger than W (say, R < W w ) will typically be based on some hardware-optimized 
implementation of grammar-school multiply. For larger radices, optimized software implementations will 
transition to a subquadratic algorithm such as Toom-Cook and eventually for very large operands, to some 
kind of fast transform-based discrete convolution. Similar large-argument optimizations apply to the other 
forms of multiply required by the Montgomery algorithm. 

Returning to the inverse computation, the number of significant bits at the bottom doubles on each itera- 
tion, until we reach the bitwidth set by the modulus q, which determines the width of the MULLb operation 
used. Since we only care about the final converged value of qinv we can do the iteration "in-place" . A 
typical C implementation of the inversion Q for R = 2 64 might look as follows, where mni 64 is a typedef 
for the native unsigned 64-bit integer type on the platform in question: 

Algorithm: 64-bit mod-inverse computation 

Data: Unsigned 64-bit odd integer modulus q 

Result: Mod-inverse qinv, such that q ■ qinv = 1 (mod R = 2 64 ) 

uinte4 tmp, qinv = XOR(3q,2); 
for j <— 1 to 4 do 

tmp = MULLe4((j, qinv); 

qinv — MULLg4(qinw, {uintQ&)2 — tmp); 
end 



Let us crunch the numbers on an actual numerical example, the 64-bit modulus q = 16357897499336320049. 
First note that 3q = 19 (mod 16), which in binary = 100112- XOR of this with 2 toggles the set bit in 
the 2s slot to yield 100012- Since we only care about these 5 bits of the initial inverse seed value, we could 
simply initialize qinvo = 17. In fact, since in this case we are targeting a 64-bit result the high bit of this 
initial iterate gains us nothing in terms of work savings: We could equally well take qinva = OOOI2 = 1, and 
in either case would still require 4 iterations to obtain a 64-bit-converged result. However, since in prac- 
tice such bit-masking is superfluous, we illustrate the algorithm exactly as given. We thus have XOR(3q, 
2) = 12180204350589856915 (note the multiply by 3 here is also modulo 2 64 ) which in hexadecimal = 
A908C752C8936C91, where we have underlined the converged hex digits, which at this point means just the 
rightmost one (though at least the low bit of the next- higher hex digit is also converged) . We note that for 
this particular choice of q our initial-iterate formula actually yields 6 good bits in qinvo. (In general there is 
a 50% chance of this "extra bits for free" behavior occurring, according to the low bits of q). We now iterate 
4 times, obtaining the data in Table Q] again with the same underline-highlighting of converged digits. The 
right-column iterate qinvj is the input (in addition to the fixed q) to the next iteration. The iteration yields 
the desired 64-bit mod inverse, which in decimal form is qinv = 9366409592816252113. Were we to do one 
more iteration, we would observe that the product q ■ qinv would have its 64-bit lower half equal to unity, 
confirming that qinv has reached its desired fixed point. 

Table 1 - Convergence history of Newtonian iteration for 64-bit mod-inverse of q — 16357897499336320049: 
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q ■ qinv j -1 


qinvj 


bits 


1 


F3D66805FF3D2BC1 


4141B6714938A4D1 


12 


2 


B0955EB99F05F001 


656E0C4A279FB4D1 


24 


3 


0A62BFBCBF000001 


6EAB2BE6389FB4D1 


48 


4 


E97F000000000001 


81FC2BE6389FB4D1 


64 



If the hardware has fast hardware support for 32-bit integers and emulated 64-bit integer support (as is 
the case on many platforms still running 32-bit operating systems), one might prefer three iterations using 
32-bit operands, followed by a single 64-bit iteration. Similarly, if a radix larger than 64 bits is used, one can 
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proceed to 64 bits in qinv using the above loop, then do a single iteration at 128 bits using software-emulated 
128-bit integer arithmetic, and so forth. As the operands get large enough to exceed the natural hardware 
wordsize W it also pays to exploit some simple optimizations resulting from the properties of the mod-inverse. 
For example, let us say our hardware supports integers of 64 bits, in the sense that the operations MULL@4 
and UMULH64 (either directly or in emulated form via the upper half of UMUL_LOHl64) are supported 
via hardware instructions. Further assume we have a 64-bit-accurate mod-inverse in hand (that is, bits 0:63 
of the inverse), and wish to do one more Newtonian modular inversion iteration modulo 2 128 to obtain a 
128-bit mod- inverse. We can do so efficiently, by way of direct computation of bits 64:127 of the inverse, 
using a combination of three of the above 64-bit MUL primitives. More generally, given a 6-bit-converged 
partial mod-inverse, one can obtain 6 more bits of precision using just three 6-bit MUL primitives. Here the 
bit count 6 need not be equal to that of the ensuing Montgomery-mul arithmetic - all the analysis holds 
generally - but let us assume it for notational convenience. 

Define an emulated 26-bit type (e.g. in C via a struct) consisting of a pair of 6-bit "digits" do and d\, 
such that a 26-bit quantity x expands - here again using C-style notation - as x — {x.d\ ■ R + x.do), where 
R = 2 b . We then take our 6-bit-converged mod- inverse qinvb and use it to initialize the low word of such a 
26-bit variable, that is, set qinv. do = qinvb and qinv.di = 0. Since we presumably are working with a 26-bit 
modulus q, we similarly store it in another 26-bit variable q = (q.di ■ R + q.do) - the 6-bit inversion will 
have only used the lower half of this. We also have by construction that the fully (that is, 26-bit) converged 
inverse satisfies MULL2b(g, qinv) = 1 (mod R). We can take advantage of these two facts to speed the 26-bit 
inversion iteration. The product q ■ qinvb yields a 26-bit partial multiplicative identity, where by "partial" 
we mean having only its lower 6 bits unity, that is, identity modulo R: 

lb '■= q ■ qinvb = {q.d\ ■ R + q.do) ■ qinv. do — q.d\ ■ qinv. do ■ R + q.do ■ qinv. do = 1 (mod R), 

where here the '•' are shorthand for full- width products of the operands in question. In terms of 6-bit 
hardware operations, since our input value qinvb is 6-bits-good, we know a priori that the lower half (bits 
: 6—1, word do) of the 26-bit product is equal to 1, and need not explicitly compute this half. Extracting 
bits 6 : 26 — 1 of If, , which is the half of interest, we have: 

Ib-d\ = MULLb(<?.G?i, qinv. do) + MULHb(<7.dcb qinv. do), 

where the addition is simple 6-bit hardware integer addition modulo R. To effect the subtraction of the 
2-word result lb — (1,/fc.di) from the constant 2 we note that this subtraction leaves the low 6-bit word 
unchanged so long as 6 > 2, i.e. we have at least a 2-bit-converged mod-inverse. The high word simply 
requires arithmetic negation, thus (2 — lb) = (1, — J&.di). We use that the lower half of this is unity to 
simplify the ensuing MULL2b(<?i^, (2 — h)) operation: since qinv still just has the low word nonzero, one 
has 

qinv ■ (2 — I) = qinv. do ■ {—I.d\ ■ R + 1). 

The low half of this is again guaranteed to be unity, so starting with a 6-bit accurate approximation to the 
multiplicative mod-inverse stored in qinv.do, the entire sequence of operations needed to perform the next 
Newton iteration and obtain a 26-bit result can be condensed to the following direct computation of the 
next-higher 6 bits of the inverse, needing just three 6-bit multiply instructions, along with one addition and 
one arithmetic negation: 

qinv.di — MULLb(— qinv.do, MULLb(<?.di, qinv.do) + UMULHb(<?.do, qinv.do)). (5) 

where the negation and addition are both twos-complement, i.e. modulo R. 

In order to compute a work estimate for ([5]), we must address the question of whether the upper-half 
product here is a true half-multiply (in the sense of needing asymptotically the same computational work as 
the a MULL operations as 6 becomes large) or whether the UMULH needs to be computed via a full double- 
width UMUL_LOHI (retaining just the upper half of the result) in order to properly capture the carry out 
of the low half of the product. In general, half-multiply implementations of UMULH for multiword inputs 
proceed by truncating the multiplication rhombus and approximating the exact carry out of the the lower- 
half product using a "carry layer" of thinness similar or equal to the bitwidth of the hardware wordsize lg W. 
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This is effective if the carries occurring in the wordsize computations are local, which is a good assumption 
if the inputs are quasirandom, in which case the chance of an incorrect carry resulting from the truncated- 
rhombus approximation is in some rough sense proportional to 1/W. In the context of the Newton iteration 
for the mod-inverse, however, the inputs to the q ■ qinv product yielding our partial multiplicative identity 
are the very opposite of random in this sense, because the very purpose of the iteration is to produce a 
product whose low 6 bits have the specific pattern 000. ..001. This by definition entails "maximally nonlocal" 
carry behavior in the low half of the q ■ qinv product. But that very same property provides a way out of 
the seeming difficulty, since it guarantees that the discarded low-half product MULLb^.dcb qinv. do) = 1, 
by construction. In other words, the UMULH in the streamlined iteration ([5]) can truly be performed as 
a half-multiply, but if one does things this way one must compare the terms summed in the carry-layer 
approximation (which sum is normally simply discarded after it has been used to generate the carry) to the 
expected value 0, in order to identify any missed carry resulting from the approximation. We shall enounter 
a similar application of error-corrected upper-half multiply in our remainder and quotient algorithms. 

Thus the cost of ([5]) can be brought down to three 6-bit half-multiply instructions, which is less than 
one-half the MUL cost needed for the naive 26-bit version of the inversion iteration, for which the inner 
/ = MULL2b(<Z, qinv) operation forms a 2&-bit result using one double- width UMUL_LOHIb operation and 
two 6-bit "half-multiply" (with respect to the width of the 26-bit inverse we seek) MULLb instructions: 

I.d : di = \JM\JL.LOUl h (q.do,qinv.d ); 

I.d±+= MULLb (g.cio, qinv .di) + MULLb (g.di, qinv. do). 

This then has its high digit negated to produce (2 — /) prior to being fed to a second such multiply of 
the MULL2b(-f, qinv). Analogous considerations apply to each additional bit-doubling iteration required, so 
the work of each successive "bit-doubling" step using ([5]) is asymptotically equivalent to the summed work 
of all the preceding iterations, hence the total work needed to obtain the fully converged inverse using this 
method is indeed less than that needed by a single "naive" Newtonian iteration performed at the full final 
precision needed. 

By way of example, let us compute the inverse of the 128-bit q = 225797717267637708506527464987314161. 
Using 64-bit hardware arithmetic we have q.d = 1654746039858251761 and q.dx = 12240518780192025. We 
compute qinv with respect to the arithmetic modulus 2 128 , the smallest power of the hardware wordsize 
greater than q. Starting with the initial guess formula Q, 4 Newton iterations yield a 64-bits-good partial 
inverse qinv. do = 18061898331188349201, and it is easily verified that MULL64(g.do, qinv.do) = 1. Now 
making use of ((5J with operand bitsize 6 = 64: 

qinv.di = MULLg4(— qinv.do, MULLg4(g.c?i, qinv.do) + UMULH64(g.<io, qinv.do)) 

= MVLL 64 (- qinv. d , 12364022002462652329+ 1620223851777327935) 
= MULL 64 (384845742521202415, 13984245854239980264) = 5329826773734796952. 

Thus qinv = 5329826773734796952 -2 64 + 18061898331 188349201, and again we see that MULLi 28 ( 9 , qinv) = 
1, as desired. 

3 A Fast Multiword Divisibility Test 

In this case it is not modmul by another number less than the radix we are interested in, but rather modmul 
by the radix R itself. In a typical modular reduction of a multiword integer of form 

n-l 

•'• ^ " ■/'"• (6) 

i=o 

one might start with the most-significant digit a n _i, reduce a„_i ■ R modulo q and carry the result rightward 
into the next-lower digit, a "left-to-right" approach to the reduction if one visualizes the terms of the above 
expansion written out in decreasing significance from left to right, analogously to the way one writes multi- 
digit numbers in decimal and other common small bases. 



6 



Here we exploit the fact that the Montgomery modmul naturally introduces an inverse power of the radix 
R every time it is performed. Thus, starting with the least-significant digit ao, a Montgomery modmul of this 
with unity yields M(ao, 1) = ao • R~ x (mod R). But this is exactly the "modular reduction carry" one needs 
if proceeding in right-to-lcft fashion, that is, the carry into the next-higher term. That observation, followed 
by a bit of carry-related work, leads immediately to the fast multiword divisibility algorithm captured in 
pseudocode form as Algorithm A. 



ALGORITHM A: IS_DIV_A, fast right-to-left divisibility check 



Data: Unsigned b-bit integer array x[n], b-bit scalars q,qinv, with q odd and 

q ■ qinv = 1 (mod 2 b ) 
Result: True if q divides x, false otherwise. 

ninth tmp, bw, cy = 0; 
for i <— to n — 1 do 

tmp = Xi — cy; 
bw = (cy > Xi); 

// Add q to loop output if had a borrow: 
tmp — MULLt, (tmp, qinv) + bw; 
cy = \JM\JLU h (tmp,q); 
end 

return (cy —— 0); 



This algorithm can be trivially generalized to also permit even divisors, by first counting trailing binary 
zeros in q (call this number tz q ), and one immediately returns FALSE if x has fewer trailing zeros than q. Oth- 
erwise - in pseudocode but with all quantities arbitrarily large here - one returns IS_DiV_A(;r', q' , qinv, n'), 
where x' — (x 3> tz q ), q' = (q 3> tz q ) and qinv is the mod-i? inverse of the right-justified divisor q'. The 
number-of- words parameter can be set via n' — n— (tzb ~> b) if one wants to account for the length reduction 
due to the right-shifting of inputs by one or more entire &-bit words, without the extra expense of counting 
leading zeros of x required if one wants to achieve the maximum possible length reduction by counting 
partial-word shifts; otherwise one can simple take n' = n. 

We assume nothing about hardware support for carry flags on add and subtract here, thus the explicit 
check for borrow-on-subtract using unsigned comparison (cy > Xi). The one slight subtlety related to the case 
where the subtraction results in a borrow is that since the underlying remainder accumulation is (mod q), 
such a borrow must be accounted for by re-adding q. Since the ensuing low-half multiply of tmp is by qinv 
and q * qinv = 1 (mod R), one can simply add 1 to the MULLb(imp, qinv) result instead of adding q to the 
computed difference stored in tmp, which is advantageous if q occupies multiple machine words. 

We can ignore the possible carry out of the resulting tmp = MULLb(imp, qinv) + bw step, because we 
are computing a result (mod q): If the addition overflows, tmp ends up being R — 2 b too small as a result 
of twos-complement arithmetic. If we restore the dropped R in the next step, instead of 



i.e. the result is the same (mod g). Thus we are letting the twos-complement arithmetic help with the 
modding here. (Note that the MULLb operation is doing effectively the same thing). 

Because of the multiply-by-unity nature of the processing of each term in the expansion of x we do not 
need the double-width multiply which opens the three-multiply sequence of the general Montgomery modmul: 
Here (ignoring the carry-in from the next-lower term for the moment) we have hi : lo = UMUL_LOHI(a;i, I) = 
: Xi, and instead of explicitly negating the output of the UMULH operation to effect the final hi — UMULH0 
step, we simply subtract the UMULH result from the next-higher term at the start of the next loop execution, 
thus inlining the negation with the leftward carry propagation. The core loop in the algorithm thus constructs 
a scaled sequence of modular partial sums such that on the iih loop iteration, 



cy = UMULH b (tmp, q) 



we have 



cy = UMULH b (trap + R,q) = UMULH b (£mp, q) + q, 




(7) 
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Table 2 - Per-loop-iteration data for Algorithm A applied to x = 2 977 - 1 and q = 16357897499336320049, with 
R — 2 64 . Rightmost column: A simple x86_64 assembly-code implementation of the remaindering loop. 



i 


Result (cy) 


— cy ■ R l+1 (mod q) 







8052108280172618802 


1CFD12E467CEDBCE 


MOV 


[x] ,R10 


1 


13395404783617144454 


4D611EA3809531E7 


MOV 


[len] ,RCX 


2 


14290423936650903017 


BD293142B725F2B8 


MOV 


[q] ,RSI 


3 


12694450473754035419 


6DA6C745723D2042 


MOV 


[qinv] ,RBX 


4 


13022541340536637118 


62A5DDA094A31338 


X0R 


RDX, RDX // cy = 


5 


7849884873665013561 


32D35CC447414DD0 


X0R 


RDI ,RDI // bw = 


6 


139461722106114244 


C4E358F892AFD7CB 


loop : 


/ / Loop downward 


7 


6660703926365324543 


841646B12435044D 


MOV 


(R10) ,RAX // x[i] 


8 


13147792529020392181 


953C002AC7E9CA9B 


ADD 


8, R10 // &x[i+l] 


9 


12940374201018017432 


87A198DD565BAE6E 


SUB 


RDX , RAX // x[i]-cy 


10 


9345504322264630629 


665B982F2C57CF9F 


SETC 


DIL // bw = CF 


11 


10439060481633841924 


C2F464196442F72D 


IMUL 


RBX , RAX // tmp*qinv 


12 


11180607432989656657 


92AB735A1C3B4927 


ADD 


RDI, RAX // tmp+=bw 


13 


6407570042918850368 


C2A0BA67701C6754 


MUL 


RSI // UMULH(q,tmp) 


14 


12260751538328612790 


339D02265F21100D 


DEC RCX 




15 


5031209829575536552 


77ABEA1607BF1817 


JNZ loop 
MOV 


// loop if RCX != 
RDX, [cy] // Output 



To obtain the true remainder r := (x mod q), we must take the output value, scale it and negate it (mod q): 

r = -cy- R n (mod q). (8) 

Thus if we desire the true remainder rather than merely knowing whether it is zero on not, we just encapsulate 
the same computational loop in a slightly different interface, which here further performs the negation prior 
to returning to reinforce that it is a (mod q) negation rather than a twos-complement (mod R) one: 



ALGORITHM A': REMAINDERS, fast right-to-left scaled remainder computation 
Data: As for Algorithm A 

Result: Scaled remainder s = x ■ R~ n (mod q). 

[Same as loop body of Algorithm A] 
if cy 7^ then 

| return q — cy; 
else 

| return cy; 
end 



By way of a concrete numerical example which can be easily reproduced either via a tiny program or "by 
hand" (using e.g. the Unix 'be' utility or a freeware package such as Pari/GP), let us consider the problem 
of finding the remainder of x — 2 977 — 1 (which needs n = 16 64-bit words to store explicitly) with respect to 
the 64-bit modulus q = 16357897499336320049. The 64-bit mod inverse qinv = 9366409592816252113, and 
Table 2 lists the successive UMULH outputs at end of each loop execution in the middle column, and the 
results of applying the above scaling formula (with loop iteration i replacing n in the radix power) to these 
successive iterates in the third column from left - we write the scaled outputs in hexadecimal form to ease 
comparison with analogous data we shall provide in § 3J The reader can verify that these right-column data 
equal 2 64 — 1, 2 128 — 1, 2 192 — 1, ... (mod q), i.e. that the iterates are indeed the scaled partial sums as given 
in 0. 

The output of the loop is cy = 5031209829575536552, which is nonzero and thus indicates that q does 
not divide 2 977 - 1, and indeed the properly scaled output 8623243291871090711 = 2 977 - 1 (mod q). 

Timing tests of Algorithm A on a commodity 64-bit PC - the author used a 64-bit GCC 4.2 build 
on a vintage 2009 sub-$1000 Apple Macbook with a 2 GHz Intel Core 2 processor - yield a timing of 17 
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cycles per loop iteration (that is, per dividend word) in the case of a 64-bit modulus. This is for a generic 
C implementation, whose only hardware-specific optimization was inlining of the x86_64 MUL instruction 
(128-bit product of 64-bit unsigned integer inputs) via a tiny assembly code macro in order to provide the 
needed UMULH functionality. Obvious optimizations to consider with a view toward speeding things up 
further include explicit unrolling of the loop, a version of the function which tests several candidate moduli 
in parallel, and coding of either the entire loop body or function in assembler on one's target architecture. 
On x86_64, this can further include usage of the hardware carry flag to eliminate the unsigned comparisons 
used to check for carry-on-add and borrow-on-subtract in our pseudocode examples. However, after trying 
all of these, the only one which yielded any appreciable throughput improvement was the multiple-modulus 
approach, which does not appear to be relevant to the "bread and butter" case of dividing a large number 
x by a single-word q, so-call "n/1 division" in the parlance of [S]. Fortunately, first appearances oft deceive; 
we shall have more to say about this shortly. 

In a very real sense a single- word-g is the worst-case scenario for the algorithm, because the serialization 
imposed by the arithmetic structure imposes a severe constraint, and the loop body is so simple that there 
is little other work which can be performed while waiting for the two high-latency MUL operations to 
complete. This also explains why loop unrolling is not helpful in this case. Things should be better for 
multiword q, since the attendant multiplies can be structured to hide the latency of the underlying hardware 
MUL operations. 

The rightmost column of Table 2 contains a simple x86_64 assembly-code implementation of the remain- 
dering loop, with operand pairs in SRC, DEST order; we further left-justify loop-control-related labels and 
instructions and tab-indent the others. It is not the assembly code itself which is of interest here but rather 
what it tells us about the expected cycle count per loop execution. The first 4 lines simply load the address 
of the x-array, the array length and the integers q and qinv into hardware registers, where we give full 
play to the historical arcana contained in the x86_64 instruction set architecture; the next 2 lines zero a 
pair of carry-related registers. The key 7-instruction sequence is the one which follows the 'loop:' label 
and implements the loop body: (1) dereference array pointer to move current dividend word into register 
RIO; (2) increment array pointer by 8 bytes to point to next word of x; (3) subtract lower-word carry from 
current dividend word (this sets the internal x86 carry flag bit CF if it underflows); (4) use the SETC 
instruction to save the state of the carry flag in DIL, the low byte of RDI (i.e. RDI contains 1 if Xi — cy 
underflowed, if not); (5) low-half-multiply of tmp and qinv; (6) add the saved borrow to the MULL result; 
(7) high-half-multiply. With regard to the latter instruction, note the x86 MUL instruction assumes one 
input is in register RAX and takes the register containing the other input as its only explicit argument; it 
then overwrites the RDX:RAX register pair with the 2-word result. 

The key performance-limiting dependency chain here is that consisting of instructions SUB-IMUL-ADD- 
MUL - the carry-saving instruction, be it SETC as here or SBB as in our later quotient-loop example, can 
be run parallel to IMUL - which according to widely available Core 2 instruction-timing data have respective 
latencies which sum to 1+5+1+8 = 15 cycles, just 1-2 cycles less than observed in the timing experiments 
on our builds, which use x vectors several thousand words in length, sufficient to render any non-loop-related 
overhead negligible. 

Our GCC-built high-level C-code and assembly versions of the loop yield respective timings of 17 and 
16 cycles, closely matching the best-expected one on the Core 2 implementation of x86_64, and the analysis 
reveals that there is no possible clever restructuring of the code which can avoid the issue of instruction 
latency. This also explains why the aforementioned multiple-modulus variant of the code yielded a gain: 
processing more than one modulus in parallel allows useful work to be done in the idle cycles of the single-q 
instruction chain. The case of a fixed dividend and multiple divisors is common enough to be of some interest, 
but if possible we would like to be able to make use of this insight regarding overlapping of independent 
sequences in the single-g case as well. For x-array dimensions of sufficient size - say 10 or more elements - we 
can effectively do so by breaking the single processing loop into 2 or more disjoint segments, each of which 
computes a partial remainder for just its portion of the summation These independent remaindering 
steps can be performed in parallel and their outputs combined at the end. By "parallel" we mean not in the 
usual sense of being run on separate processors or in separate execution thread, but the interleaved execution 
is quite akin to multithreading, just in the sense that the "added threads" are really idle cycles in an existing 
single-threaded execution stream. Because this loop restructuring is quite distinct from unrolling (although 
it can be combined with the latter process, if desired), we prefer the term "loop folding", and define the 
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number of array segments which get processed in parallel as the folding factor F. We shall also use the term 
".F-way folding" to describe a particular value of F. For example, a 2-way-folded implementation might look 
like that shown in Algorithm I Ax2l If the dividend-array dimension is not an exact multiple of F, we simply 
add one or more zero-padding elements beyond x n -\ in order to make it so. 



ALGORITHM Ax2: 2-way loop-folded implementation of algorithm IS_DIV_A. 

ninth n2 = n/2, tmpO, tmpl, bwO, bwl, cyO = 0, cyl — 0; 
for i to n2 — 1 do 

tmpO — Xi — cyO; tmpl = Xi+ n 2 — cyl; 

bwO = (cyO > Xi); bwl = (cyl > x i+n2 ); 

// Add q to loop output if had a borrow: 

tmpO — MULL b (fmpO, qinv) + bwO; tmpl = MULL b (fmpl, qinv) + bwl; 
cyO = UMULH b (impO, q); cyl = UMULH b (tmpl, q); 

end 

uintb scale — [Compute scaling factor P ■ R = R n2+1 (mod q); cf. Algorithm C] 

cyl <- MONT_MUL b (q/l, scale, q, qinv); // cyl * P (mod q) 

// Sum the scaled partial remainders: 

uintb cy = (cyO + cyl); 

// Check for overflow on addition: 

if (cy < cyO) or (cy > q) then 

I cy «- (cy - q); 
end 

return (cy == 0); 



After the loop completes, we combine the F partial remainders via 

cy = cy + cy 1 -P + cy 2 -P 2 + ... + cy F - 1 -P iF - 1 '> (mod q), (9) 

where the x-array is presumed to have been zero-padded to make n a multiple of F and the modular radix 
power P = R n / F can be efficiently computed using Algorithm C, described in §0 The weighted summation 
can be efficiently performed by rearranging it using the standard nesting, whereby one computes the pairwise 
weighted-sums beginning with the innermost and working outward: 

cy = cy Q + (cyx + (cy 2 + ... + (cy F - 2 + cy F -i ■ P) ■ P) ... ■ P) (mod g). (10) 

The result is then negated (mod q) and scaled using a final modmul by P using the identity ([8]) if the 
true remainder is desired. This procedure requires computation of just a single modular power of the radix 
R; note that if one also uses a Montgomery-mul to perform the scaling multiplies as shown in our 2-way 
pseudocode example, one computes not P but rather P ■ R = R' l ^ F+1 . 

In this context, the same instruction-sparseness which causes the left-to- right remaindering algorithm 
to be latency-dominated in serial-execution mode now proves to be an advantage, since it yields much 
opportunity for idle-cycle-filling interleaving. We tried several folding factors, first using C code; again using 
the single- word-q case on the Core 2, for F = 1, 2 and 4 we obtain average cycle counts per x-word of 17, 9.5 
and 6.5 cycles, respectively. We have already established that the strictly serial loop has a minimum expected 
cycle count of 15, so the for 2-way and 4-way folding the best-case cycle counts are 15/2 = 7.5 and 15/4 = 
3.75. These - especially the last one - are almost certainly optimistic, as they disregard additional factors 
such as instruction-decode and issue-rate limits. In practice, our pure-assembler version of the respective 
loops yields a mere 1-cycle gain for F = 1 and only modest gains of one-half-cycle per input word for the 
larger values of F, thus our current best timings are 16, 9 and 6 cycles per x-word. We did not attempt 
to go beyond F — 4 since at that value the interleaved-instruction code uses up nearly all of the available 
general-purpose hardware registers of the x86_64 architecture. Larger folding factors are certainly possible 
but will require very careful analysis to reduce the need for register spills, register-freeing optimizations such 
as accumulation of multiple carry flags in a single register, and much timing-driven tuning of the code in 
order to yield further cycle-count reductions. 

For large-integer arithmetic in the sense that the modulus q is itself much larger than the natural integer 
wordsize of the underlying hardware, the cost of the algorithm is dominated by the two multiply operations, 



10 



the MULL of which is a true half-multiply in that its cost is asymptotically half that of a general double- 
width product. The UMULH operation, as noted in § [2J cannot generally be made similarly faster than a 
double-width product because we cannot be sure that the inputs are such that the carries into the high half 
of the "multiplication rhombus" can be safely approximated by truncating the rhombus. However, as we 
shall show in the following section, the discarded low half of the full double- width product trap ■ q here again 
has a simple closed-form expression which can be computed cheaply without any explicit multiplication, 
allowing the carries into the upper half to be accurately predicted and thus the UMULH result to again 
be computed at the same asymptotic cost as needed for the MULL, as was the case for the streamlined 
Newtonian mod-inversion. Thus we approximate the cost of each execution of the above loop for multiword 
moduli is asymptotically equal to that of two "genuine half-multiplies", or equivalently a single general 
double-width product, UMUL_LOHI b (a;, y). 

We shall examine how to efficiently - that is, with asymptotically less effort than needed for the above 
loop, as n becomes large - perform this scaling and final modular reduction step in § El after first considering 
a variant of the above remaindering algorithm, which will prove useful in terms of optimizing both the 
scaled-remainder computations, as well as the scaling step needed if the true remainder is desired. 

4 Fast Divisibility Check, Version 2 

This variant results from the seemingly trivial observation that since q ■ qinv = 1 (mod R), if we modify the 
final step of the loop body in Algorithm A by replacing the high-half multiply 

cy = VMULB h (tmp,q); 

by a full double-width product which returns the lower and upper halves of tmp ■ q in a pair of b-bit integers: 

hi:lo= UMUL_LOHI b (imp, q); 

then the value of the variable lo on each loop execution simply recapitulates x— cy(+q), with the parenthetical 
(to denote that it is conditional) addition of q occurring if a borrow resulted from the subtraction, that is, 
if Xi < cy. This allows the low half of the above double-width product to be cheaply computed without 
explicit multiplication, which in a multiword-operand setting permits a cheap form of error correction of 
the approximate "probabilistic" carries into the the high-half of the product in a truncated-rhombus half- 
multiply implementation of the UMULH. This optimized variant is indicated in our pseudocode below via 
addition of lo as an optional third argument to the UMULH operation. 

Note that one case where this fails to offer a speed advantage is when q fits into a hardware word and the 
compute architecture has no separate UMULH instruction. For example, an x86_64-specialized version of 
Algorithm A for 64-bit moduli has little choice but to use the above UMUL_LOHI variant, since the x86_64 
MUL instruction automatically returns both halves of the 128-bit product in the RDX:RAX register pair, 
with RDX holding the high half of the product and RAX the low half. On a 64-bit architecture like the older 
(and groundbreaking for its time) DEC Alpha, on the other hand, one would prefer the separate-upper-half 
variant, since the Alpha ISA follows a strict "two input register, one output register" RISC-style instruction 
paradigm and thus has separate hardware-multiply instructions (called MULQ and UMULH) to generate 
the low and high halves of a 128-bit unsigned product. 

Of more immediate usefulness is the observation that the quantity lo = x; — cy(+q) also satisfies a 
scaled-modular-partial-summation property, namely that on the ith loop iteration, 

i 

lo = R l ajR j , for i = 0, n - 1. (11) 

This allows us to cut one loop execution off of Algorithm A, and replace the final loop execution of Algo- 
rithm A with a simple multiply-free postprocessing step involving the high word of x. The resulting variant 
is captured in Algorithm B. 
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ALGORITHM B: IS_DIV_B, right-to-left divisibility check, version 2. Data as for Algorithm A. 



Result: True if q divides x, false otherwise. 
ninth tmp, lo, cy = 0; 
if n —— 1 then 

// Explicit mod needed in single-term case 

return xo%q; 
end 

for i to n — 2 do 

tmp = Xi — cy; 

// Inline multiply-less computation of low half of tmp*q here, 
// f or optional fast half-mul UMULH with carry correction: 
bw = (cy > Xi); 
if bw then 

| lo — tmp + q; 
else 

| lo — tmp; 
end 

tmp = MULLb(tmp, qinv) + bw; 
cy = UMULHb(trap, q[, lo}); 
end 

// Assume loop index is preserved: i = n — 1. Final term needs no explicit multiplies: 
tmp = Xi — cy; 
if cy > Xi then 

I lo = tmp + q; 
else 

| lo = tmp; 
end 

return (lo == 0); 



If instead of merely checking divisibility of x by q, we desire the true remainder r = x (mod q), one again 
simply uses a slightly modified functional interface (which we shall refer to as REMAINDER_B but not give 
explicitly), whereby one returns the 6-bit residue lo directly, then takes the output value lo and scales it as 

r = lo-R 71 - 1 (mod q). 

Note that unlike Algorithm A, if n = 1, that is, the input has just a single term Xo, we must take care to 
do an explicit modular reduction of this term prior to returning, since otherwise xo passes through the loop 
unaltered into the output. This special case is easily handled as shown, however. 

Table 3 lists the per-iteration values of the variable lo (not computed explicitly in the above pseudocode) 
in the middle column for the same example computation as was used for Algorithm A. It can be seen that 
the properly scaled - using - values of this variable given in the rightmost column exactly match the 
scaled data in the analogous column of Table 2. 

5 Efficient Powering to Obtain R n modulo q 

We next turn to computation of the scaling factors R n (mod q) and i? n_1 (mod q) required to compute 
the true remainder from the outputs of the Algorithm A and B loops, respectively, that is to transform the 
outputs out of "Montgomery space" . To aid the discussion here we introduce a small piece of new notation, 
by denning Mp(j, k) with j, k nonnegative integers as the Montgomery product of the jth and fcth powers 
of the radix R, yielding radix power (j + k — 1) (mod q) 

Mp(j, k) := M(W (mod q), R k (mod q), q; R) = i?^" 1 (mod q). 

We will also make use of an abbreviated form of this, 

M P (j,k) ->(j + k-l). 
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Table 3 - Per-loop-iteration data for Algorithm B applied to x = 2 977 - 1 and q = 16357897499336320049, with 
i? = 2 64 . 



i 


lo 


lo ■ R 1 (mod q) 





18446744073709551615 


1CFD12E467CEDBCE 


1 


10394635793536932813 


4D611EA3809531E7 


2 


5051339290092407161 


BD293142B725F2B8 


3 


4156320137058648598 


6DA6C745723D2042 


4 


5752293599955516196 


62A5DDA094A31338 


5 


5424202733172914497 


32D35CC447414DD0 


6 


10596859200044538054 


C4E358F892AFD7CB 


7 


18307282351603437371 


841646B12435044D 


8 


11786040147344227072 


953C002AC7E9CA9B 


9 


5298951544689159434 


87A198DD565BAE6E 


10 


5506369872691534183 


665B982F2C57CF9F 


11 


9101239751444920986 


C2F464196442F72D 


12 


8007683592075709691 


92AB735A1C3B4927 


13 


7266136640719894958 


C2A0BA67701C6754 


14 


12039174030790701247 


339D02265F21100D 


15 


4097145961007838330 


77ABEA1607BF1817 



Let us also define two specialized versions of the basc-i? Montgomery modmul - Firstly a modular squaring 
M(x,x): 



Algorithm: MONT_SQR b 

Data: Unsigned b-bit integers x, q, qinv, q odd and q ■ qinv = 1 (mod 2 b ) 
Result: The radix R = 2 b Montgomery square x 2 ■ R^ 1 (mod q) 

uint b hi:lo = USQRJLOHI b (a;); 
lo = MULLb(gmw, lo); 
lo = XJMULR h (q,lo); 
if hi < lo then 

return hi — lo + q; 
else 

! return hi — lo; 
end 



And a "downshift multiply", which effects a Montgomery multiply-by-unity, M(x, 1): 



Algorithm: MMUL_ONE b 

Data: Unsigned b-bit integers x, q, qinv, q odd and q ■ qinv = 1 (mod 2 b ) 
Result: The radix R = 2 b Montgomery multiply by unity: x ■ R^ 1 (mod q) 

uintb lo — M\JLL b (qinv, x); 
/o = UMULH b ( 9 ,/o); 
if lo ± then 

! return q — lo; 
else 

| return lo; 
end 



For multiword moduli q the USQR_LOHI step in squaring-specialized version will be up to twice as 
fast as a general double-width product computed using UMUL_LOHI, and the multiply-by-unity needs no 
double-width product at all. 
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The nature of the Montgomery modmul is such as to make the chief difficulty here "getting off of 
square one" , as it were. This is because one typically chooses the radix R to be the smallest power of the 
machine wordsize W such that W k > q, making it relatively cheap to compute R (mod q). For example, if 
q < W — 2 64 one could use the intrinsic 64-bit modulo function supported by one's favorite programming 
language, although this may or may not map to an efficient hardware instruction, and in practice is often 
very slow. 

Once one has R (mod q), if one's first instinct is to then perform a modular squaring, one now runs 
headlong into the fact that Mp(l, 1) — >• 1 rather than the desired power, 2. One can of course start with R 
(mod q) and compute R 2 (mod q) via a sequence of b modular doublings. But especially if the required power 
n is not terribly large it is worth trying to optimize the computation of this initial scaling factor R 2 (mod 
q). For example, if the integer-truncated ratio [R/q\ is sufficiently small as to be exactly storable in the 53 
bits of an IEEE64 floating-point mantissa, one can compute the quotient R/q using floating-point arithmetic 
(presumed to be faster than hardware integer division, as is true on most commercial CPU architectures), 
then cast back to integer and use integer arithmetic to compute R (mod q) = R — q ■ \_R/q\. The author's 
own specialized code for the case q < 2 64 uses a small sequence of 64-bit floating point operations (including 
just one explicit floating-point inversion) to efficiently compute 2 96 = R?l 2 (mod q), then feeds that to 
MONT_SQR 64 to obtain 2 128 = R 2 (mod q). This approach can easily be generalized to the multiword-g 
case. 

In any event, once the starting scaling R 2 (mod q) is in hand we would like to use a minimal-cost sequence 
of the above two specialized modmul operations to compute R n+1 (mod q) or R n (mod q), which are the 
radix powers needed to scale the Algorithm A and B loop outputs, respectively, if we also use a Montgomery 
multiply for the scaling step. Our method is reminiscent of the classical algorithm used for left-to-right 
binary exponentiation [5], with a call to MONT_SQRb for each powering bit processed, this mod-squaring 
call being supplemented by a bit-parity-conditional call to MMUL_ONEb . The latter "downshift- multiplies" 
thus play the role of the scalar multiplies which augment the basic sequence of squarings in the left-to-right 
binary powering algorithm whenever a set bit is encountered in the exponent being left-to-right bit-scanned, 
with the added difference that in the present case, the added calls to MMUL_ONEb are triggered by zero 
bits rather than ones. 

A further difference with classical powering ladders is that here we do not know a priori the needed 
sequence of operations, or at least we have not found a method to do so; however this is not problematic 
because we can use an inexpensive preprocessing computation to initialize a bitmap which is then used to 
control the subsequent powering sequence, by encoding the sequence of powers needed to obtain the desired 
power of R (mod q) via calls to the above two specialized versions of the modmul. The preprocessing loop 
works backward from the final power, roughly halving (halving-plus-one, to be precise) the current power at 
each step and using its parity to encode the needed operation in the resulting bitmap. 

In order to simplify the discussion, let us assume we are using Algorithm B, thus that we need R n (mod 
q) as the radix-power input to the output-scaling Montgomery multiply used to produce the true remainder 
x (mod q). Assuming n to be at least 3, starting with power p — 2 we always make one immediate call to 
MONT_SQRb, yielding power 3. For n > 3 we now have the following 2 choices: 

[a] Mp(2,3)^4; 

[b] Mp(3, 3) -> 5, 

which differs from subsequent iterations, for which our two options (based on the input power p at that 
iteration) are as follows: 

[c] Mp(p,p) ->2jj-1; 

[d] Mp(p, Mp(p, 0)) -> 2p - 2. 

The preprocessing step does this by working backwards and encoding the needed steps as a sequence of bits: 

Bit j = : 1 1 means [a] or [b], respectively; 
Bit j > : 1 1 means [c] or [d] . 

Since the [a]/[b]-choice can be made based on the parity of the power p resulting from the halving loop, 
this frees up a bit; accordingly in our implementation we initialize the bitmap bit-index j to and use it 
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strictly for the [c]/[d] -choice selection control. The Algorithm C pseudocode illustrates this powering scheme. 
The 'bitmap' type here is simply shorthand for a suitably long integral type to hold the string of resulting 
bits. Unless the power n is truly gargantuan, meaning x larger than several gigawords, a 32-bit integer is 
perfectly adequate to the task. 



ALGORITHM C: RADIX_POWER, computes R n (mod q) 

Data: Array size n > 2, uint b pow = R 2 (mod q), q, qinv, q odd and q ■ qinv = 1 (mod 2 b ) 
Result: The nth modular power of the twos- complement radix, R n (mod q) 

if n == 3 then 

i pow <— MONT_SQRb(po«;, q, qinv); //pow = R A (mod q) 

else 

uint b ptmp <— MONT_SQRb(pow, q, qinv); // ptmp = R 3 (mod q) 
int j «— 0, p <— n, bra -s— 0; // Init loop parameters and bitmap 
while p > 5 do 

// Each M-mul accumulates another inverse power 1/R; add 1 to 
/ / current power after each halving step here to account for this: 
BIT_SETC(6m, j, IS_EVEN(p)); 
p <- (p/2) + 1; 

3<-j + 1; 

end 

// Next step depends on value of p resulting from above loop: 
if p == 4 then 

| pow <— MONT_MULb(pow, ptmp, q, qinv); 
else if p == 5 then 

[ pow MONT_SQRb (ptfTip, q, qinv); 
end 

// Bitmap-controlled powering loop is only entered if n > 5: 
for i •<— j — 1 to do 

if BIT_TEST(6m, i) then 

ptmp <— MMUL_ONEb(pow, q, qinv); // Reduce R-power by 1 
pow «— MONT_MULb {ptmp, pow, q, qinv); // ...and multiply. 
else 

| pow <- MONT_SQR b (pow, q, qinv); 
end 
end 
end 

return pow; 



The function BIT_SETC {bitmap bm, int j, boolean cond) here sets the jth bit of the bitmap argument 
if the condition {cond) is true; in our case, if the current value of the power p is even. The reason we only set 
any bits for input powers > 5 is because smaller powers are handled by the special-casing described above 
to obtain p — 4 or 5 from the pair of starting powers p = 2 and 3. (Note that we could start things off with 
just the single power p = 3 and then use the option [c]/[d] switch to obtain either p = 4 via Mp{3, Mp{3, 0)) 
or p = 5 via Mp{3, 3) but since we have already used p = 2 to get p = 3 it is more efficient to use the power 
p = 2 directly rather than recomputing it via the downshift-mul Mp{3,0)). 

The final step is to use the resulting power to multiply the scaled remainder resulting from the mod-loop. 
Depending on whether one computes a remainder uaing Algorithm A or B, one feeds either n + 1 or n to 
the radix-powering step: 

/ /Algorithm A flow : / /Algorithm B flow : 

tmp = REMAINDER_A(a;, q, qinv, n); tmp = REMAINDER_B(a;, q, qinv, n); 

pow = RADIX.POWER(q, qinv, n + 1); pow = RADIX.POWER( (? , qinv, n); 

In both cases one then computes the true remainder x (mod q) via a final modmul: 

r = MONT _MULb {tmp, pow, q, qinv) . 
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In fact it is useful to have both options because it affords another modest optimization opportunity, 
namely that of precomputing the powering bitmaps for both n and n + 1 (that is, just the BIT_SETC - 
containing loop above) prior to performing the main bitmap-controlled powering loop. One can then take 
the result with the fewest set bits to generate the scaling needed to apply to the output of Algorithm A 
or B, respectively. Since n is given, this can all be done at the outset and then used to choose which 
scaled-remainder algorithm to use. The general rule is: If the even element of the (n, n + 1) pair has fewer 
set bits than the odd we should use Algorithm A for the remainder computation, otherwise we should use 
Algorithm B. The borderline case of the even element having one less set bit is a wash, because the cost of 
the extra loop execution and concomitant MULL / UMULH pair of Algorithm A offsets the cost savings of 
one fewer call to the MMUL.ONE function. 

Let us again illustrate with a small numerical example: For our example of x = 2 977 — 1 and q = 
16357897499336320049, if using the Algorithm B output the radix-powering step begins with n = 16, and 
the bitmap-creation loop produces successive values p — 16 and 9, at which point it exits with a binary 
bitmap bra = 01 and p = 5. Beginning with the precomputed square R 2 — 5575771501247148520 (mod q), 
the algorithm then does the sequence of four steps summarized in Table 4 with resulting mod-powers of R, 
the first two of which are done by the special-casing in the above code and the last two by the final for-loop 
in Algorithm C, which runs through the bitmap in reverse bit order. 



Table 4 - RADIX_POWER steps to compute R ie (mod q) 



powJn 


bit 


function calls 


operation 


pow.out 


2 




MONTJ3QR 


Mp( 2, 2) 


3 


3 




MONT.SQR 


Mp( 3, 3) 


5 


5 





MONT.SQR 


Mp( 5, 5) 


9 


9 


1 


MONT.SQR, MMUL.ONE 


Mp( 9, Mp(9,0) ) 


16 



This yields R 16 = 1547775041475743422 (mod q), and a final Montgomery-multiply of this with the 
REMAINDER_B output trap = 4097145961007838330 yields the true mod, r = 8623243291871090711 = 
2 977 — 1 (mod q). If we instead use the Algorithm A route we execute the radix-powering step beginning 
with 'n' (really n + 1) = 17, the bitmap-creation loop produces successive values p = 17 and 9 and exits with 
bitmap bra = 00 and p = 5. For this exponent the resulting powering sequence requires no Montgomery- 
mul- by- unity, as shown in Table 5. 



Table 5 - RADIX_POWER steps to compute R 17 (mod q) 



powJn 


bit 


function calls 


operation 


pow_out 


2 




MONT_SQR 


Mp( 2, 2) 


3 


3 




MONT.SQR 


Mp( 3, 3) 


5 


5 





MONT.SQR 


Mp( 5, 5) 


9 


8 





MONT.SQR 


Mp( 9, 9) 


17 



The result is R 17 = 8502984233828494641 (mod q), and a final Montgomery- multiply of this with the 
REMAINDER_A output trap = 11326687669760783497 again yields the desired true mod. 

In order to collect some statistics about the efficacy of counting the bitmap set-bits for such pairs and 
then choosing Algorithm A or B depending on the result, we have calculated the number of set bits in the 
bitmap for all (n, n + 1) pairs from n — 6 (the smallest value for which Algorithm C creates a bitmap) up 
to n — 2 20 . The average difference in the number of bits is just over 1 (we obtain 1.0000123948), but the 
maximal difference is asymptotically the same as the number of bits in the bitmap. This can be shown by 
examining the worst-case scenario, which is when n = 2 a + 1, for which n yields a bitmap of (a — 1) binary 
zeros and n + 1 yields one of (a — 1) binary ones. Again, this is only germane if n is not huge, due to 0(lg n) 
modmul count of Algorithm C, which is asymptotically vanishing relative to the cost of the remaindering 
loops of Algorithms A and B as n — > oo. 
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6 Use of the Remainder to Obtain the Quotient 



The quotient-producing algorithm again starts with the basic property of the Montgomery inverse, 

q ■ qinv = 1 (mod R). 

The general theme here is expressed by the question "can we multiply the input x by the mod-inverse qinv 
to somehow obtain the quotient?" Starting with the original dividend x this leads nowhere, but once we 
have computed the remainder r = (x mod q), we note that the difference (x — r) is divisible by q, that is, 
(x — r) = k ■ q, with k a nonnegative integer. Thus 

(x — r) ■ qinv = k (mod R). 

In other words MULL b (a; — r,qinv) yields the bottom 6 bits of the quotient [x/q\, exactly. If k < R (that 
is, x < (q + 1)R) this is all we need. For arbitrarily large quotients, that is, k possibly much larger than 
the modulus q, one approach is to "extend the mod-inverse" by doing more Newton iterations to produce a 
"full-sized" mod- inverse qinv n , such that 

q ■ qinv n = 1 (mod R n ), 

for which a length- R n MULL(a; — r, qinv) yields the full quotient, exactly. But this is very wasteful in that 
the work scales supralinearly in n. At this point we take a clue from the remainder computation and realize 
that we can combine the above idea with the right-to-left Montgomery-mul-based carry scheme to compute 
the quotient in linear-work fashion, one radix- i?-sized piece at a time. The resulting method is captured in 
Algorithm D. 

Note that the subtraction of the scalar (in the sense that it fits into a single 6-bit integer) remainder r from 
the vector x may result in a borrow which propagates as far as the uppermost word of the result, but this 
possibility can be efficiently dealt with in parallel with the quotient computation by initializing the value of 
the loop-carry variable cy to equal the remainder. Since subsequent values of cy are the result of a UMULH, 
they cannot exceed the largest possible UMULH output, UMULH(i?- 1, R- 1) = [(R 2 -2R+1) / R\ = R-2, 
thus permitting the subtraction of both cy and the binary (strictly or 1) borrow resulting from the preceding 
loop execution in the first step of the loop body. (In other words the sum cy + bw cannot overflow). The 
modmul portion of the loop body is more or less identical to that of Algorithms A and B, with the crucial 
difference that the output of each loop execution consists of two 6-bit data: the current word of the quotient, 
which is in the MULL output, and the leftward carry into the next-higher word, which is the UMULH 
output. No radix-power scaling of the output is needed. 

ALGORITHM D: The quotient computation 

Data: As for Algorithm A, plus the remainder r and an integer array y[n] to take the result. 
Result: The quotient [x/q\ is returned in y, which may coincide with x. 

uintb tmp, bw = 0, cy = r; 
for i <- to n - 1 do 

tmp = Xi — bw — cy; 

// Unsigned compare of result to check for a borrow: 

bw — (tmp > Xi); 

tmp — MULLt, (imp, qinv) + bw; 

cy = UMULH b (tmp,g); 

// Write the current word of the quotient, which is in the MULL output: 
y % = tmp; 
end 

There is also a key difference between Algorithms A/B and Algorithm D with regard to the subtraction 
which begins each loop execution. Here, since the quotient computation is literal rather than (mod q), if 
the subtraction results in a borrow, that is accounted for by carrying the resulting —1 into the next-higher 
word. Similarly to the remainder computation, the above quotient algorithm allows for a fast computation 
of the (discarded) low half of the final tmp ■ q product, which could be used in a streamlined error-corrected 
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Table 6 - Per- word data for Algorithm D applied to x = 2 977 - 1 and q = 16357897499336320049, with R = 2 64 . 
Right column: An x86_64 assembly-code implementation of the 2-way-folded quotient loop. Execution order 
matches natural reading order. 



i 


Quotient word yi 







6364180061714936936 


// RCX=n/2, RSI= 


=q, RBX=qinv, RDI , RDX=cy0 , 1 , R8 , R9=bw0 , 1=0 


1 


4771973621301622518 


LEA 


(RIO, RCX, 8) ,R11 


// Rll = &x+n/2 (&x in RIO) 


2 


694724920058399436 


LEA 


(R13,RCX,8) ,R14 


// R14 = &y+n/2 (&y in R13) 


3 


7462732776264284083 


loop 


: // Loop from n/2 to 1 


4 


15651191667900344027 


SUB 


R8 ,RDI 


SUB 


R9 ,RDX // (bw+cy) via cy-(-bw) 


5 


684779273839653350 


MOV 


(RIO) ,RAX 


MOV 


(R11),R12 // x[i], x[i+n/2] 


6 


8910056920539811989 


SUB 


RDI.RAX 


SBB 


R8 ,R8 // tmpO=xO-(bwO+cyO) , R8=-CF 


7 


6625598233439971816 


SUB 


RDI,RDI 


SBB 


R9 ,R9 // tmpl=xl-(bwl+cyl) , R9=-CF 


8 


13578887251066731535 


ADD 


8, R10 


ADD 


8, Rll // Increment x-array pointers 


9 


7249027741998019233 


IMUL 


RBX , RAX 


IMUL 


RBX.R12 // tmpO.l *= qinv 


10 


11772736962114281085 


MOV 


RAX, (R13) 


MOV 


R12,(R14) // y[i] ,y[i+n/2] <— tmpO.l 


11 


15530135107470554958 


ADD 


8,R13 


ADD 


8.R14 // Increment y-array pointers 


12 


6468054066637286049 


MUL 


RSI 


MOV 


RDX.RDI // cyO = MULH64(q,tmpO) 


13 


8083046564352798341 


MOV 


R12.RAX 


MUL 


RSI // cyl = MULH64(q,tmpl) 


14 


147809 


DEC RCX 






15 





JNZ loop // loop if 


RCX != 



version of a multiword UMULH: Here the low half of a full double- width UMUL_LOHIb (trap, q) must simply 
be equal to the result of the opening tmp = x.i — bw — cy operation. 

For our example x = 2 977 - 1 and q = 16357897499336320049, Table 6 lists the successive y t outputs for 
the above loop, which can easily be verified as the coefficients of the quotient \x/q\ with respect to radix-2 64 
unsigned arithmetic. 

At this point we feel compelled to make a few more comments regarding the building-up of the quotient 
in the above method. We illustrate things using a slightly smaller example, x fitting into three 64-bit words 
and the same 128-bit modulus q we used in Section 1 to illustrate the streamlined mod-inverse computation: 

x = 153238840814299457340643142885404331762436489574620087 
q = 225797717267637708506527464987314161, R= 2 128 
qinv = 98317950452290864966529955359911823633 

= 5329826773734796952 • 2 64 + 18061898331188349201 . 

The Algorithm A loop output = 204005585172405248723488909070772720, which is subsequently negated 
and multiplied by R 2 (mod q). The Algorithm B loop output = 85991941466095605134529253504604205, 
which is multiplied by R (mod q). Both methods obviously give the same remainder, 

r = 130392762589805994888402779408669015 . 

We next form (x — r) in preparation for the quotient computation: 

x-r= 153238840814299457210250380295598336874033710165951072 
= 450328479259411 • 2 128 + 18248253616373032484 • 2 64 + 17701223841397244512 
= 450328479259411 • 2 128 + 336620864253378130591640020431239938656 . 

MULL of the bottom 128 bits of (x — r) with qinv gives the exact quotient y = \x/q\ = (x — r)/q: 

MULLi2 8 (336620864253378130591640020431239938656, qinv) = 678655403024582752 . 

At this point we can skip the remaining Algorithm D loop execution if we like, since it only serves to nullify 
the remaining bits of (x — r). This, and the fact that the loop runs through the words of (x — r) in strictly 
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ascending order, highlights a seemingly curious property of the quotient computation: we use only the jth 
and lower words of (x — r) to compute the jth word of the quotient y. Thus, if we compute in advance the 
expected number of nonzero words in y, we can compute y using only that number of low words of (x — r). 
In the above example, since we can see easily that the quotient will fit in a 64-bit word, we can exit the loop 
early as noted, but we can in fact use just the low 64 bits of qinv and compute the quotient using just a 
single MULL of that with the the low 64 bits of (x — r) : 

MULL 64 (17701223841397244512, 18061898331188349201) = 678655403024582752 . 

The idea that one can compute the exact quotient without explicit reference to the high words of [x — r) may 
strike readers whose intuition in such matters has been shaped by classical long division (that is, most of us) 
as bizarre, but note that the datum (x — r) is the result of parsing the entire original input x being divided 
in order to compute the remainder; the Montgomery modmul-bascd quotient algorithm is simply exploiting 
the fact that the remainder-subtraction has in effect "shifted" all the information needed to compute the 
quotient y into the bottom [lg y~\ bits of the quantity (x — r) . (We admit that seeing this effect in action can 
be somewhat startling at first.) 

We have not found it possible to modify the above sequence to permit for obtaining of the remainder 
and the quotient without the above second pass through the (remainder-subtracted) data, that is, "in one 
fell swoop" as is done in classical long divisions implemented in left-to-right fashion. However, the quotient 
algorithm permits the same kind of folded-loop interleaved-instruction-stream optimization as we described 
for the remaindering algorithms (using Algorithm A for the detailed demonstration) , when x is large relative 
to the divisor q. There is a small twist when applying this approach to the quotient, however, which we 
illustrate using the smallest nontrivial folding factor, F = 2. The quotient Algorithm D applied to the 
resulting 2 half-length vectors now needs partial remainders Tq and r\ which will be subtracted from the 
lower and upper dividend halves prior to entering the algorithm's loop. The upper-half partial remainder 
is just that, ri = [x/R n ^ 2 \ (mod q). But for the lower-half portion of the computation, we consider again 
a concrete case such as captured in Table 6, and note that the lower-half quotient data must be the same 
as those produced by the serial version of the algorithm, e.g. quotient words yo — j/7 in the table. Thus the 
correct lower-half partial remainder is the full- dividend remainder. The generalization of this to larger folding 
factors involves a "downward cascade" of partial remainders, in which each partial remainder is that for the 
entire portion of dividend (x) array above it. For instance, if we are using a loop-folded implementation 
of Algorithm A for the remaindering, we can compute the needed terms via a slight rearrangement of the 
partial-remainders in Rather than using the nesting (|10[) , which is inappropriate if the quotient is being 
computed because the intermediate partial remainders are negated (mod q), we now explicitly negate each 
term to produce the required partial remainders in descending, chained fashion: 

Tf-i = -cyF-i ■ P, r F - 2 = (tf-1 - cyF-2) ■ P, r Q = (r x - cy ) ■ P, (12) 

where all terms are computed (mod q) . The observant reader will be asking how to account for the "missing 
borrows" here, e.g. for F = 2 if the processing of the i = ri/2 — 1 term in Algorithm D leaves us with 
bw = 1 , which in the serial version of the algorithm is subtracted from the low word of the upper half of the 
x- vector, x n /2- In loop- folded mode this portion of the vector starts with its borrow set to 0, but note that 
this automatically compensated for in the above construction of the partial remainders, since the resulting 
carry subtracted from the x n /2 term will be one larger as a consequence of the same truncation of the borrow 
chain in the loop-split remaindering computation. 

For the full-DIV (2-pass, one for remainder and the second for quotient) algorithm, again using GCC 
on the Core 2 architecture, we see an appreciable speed difference between our C code and assembler 
implementations of the 2-way and 4-way-split algorithm, likely because the need to compute distinct array 
read and write addresses - we used an out-of-place implementation for generality, rather than making use 
of the expedient of overwriting the dividend by the quotient - leads to severe register pressure even in the 
2-way-split case, as our x86_64 assembly code for the loop body in the rightmost column of Table 6 illustrates. 
This is a direct pseudocode rendering of our inline assembler, and 13 of the available 14 general-purpose 
registers (RSP and RBP are reserved for use by the operating system and compiler) are used. For the 
2-way-split case, our C code needs 21 cycles per input word for the full DIV; our assembler version needs 
just 17 cycles. In the 4-way-split case the C code needed 15.5 cycles per word, compared to a mere 12.5 for 
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the assembly. Thus, for a register-lean custom assembler implementation of the 2 and 4-way-split quotient 
loops, we obtain virtually identical timings for the quotient computation as for the remaindering. 

One other note about the assembly coding, as illustrated by the example snippet, relates to the issue 
of detailed scheduling of the high-latency IMUL and MUL instructions. The MUL is especially interesting 
because it has both high latency and the x86-designer-imposed constraints of the implicit input in RAX and 
the two output words being written to the RAX and RDX register^]. Thus if one is doing more than one 
MUL in a given block of assembly, one must copy the outputs which are needed (in our case RDX) to a 
different register or to memory, then load one datum of the following MUL input pair into RAX before doing 
the next MUL. One would think that this would make for an instruction scheduling nightmare, but that is 
not the case. Notice in our assembly that we appear to blithely disregard the timing impact of the foregoing 
register-copy issues and simply schedule the MULs and succeeding output-copying MOV instructions in 
simple sequential order (lines i — 12 and i = 13 of the table). This is because the x86 has many more 
physical registers than the mere 16 named ones visible to the programmer. This large "hidden" register pool 
supports a very powerful runtime-optimization capability, via the register-rename and out-of-order execution 
engines built into the processor. These runtime optimizations will typically assign separate physical registers 
to store such (apparently) troublesome intermediates, although it is impossible to be certain without a cycle- 
accurate simulator of these chip internals. Nonetheless, our timings provide indirect evidence of this process 
at work, since the resulting timings would not be possible if the dependencies were real. 

Thus, the dependence of the instruction sequence as it relates to getting MUL inputs into the RAX 
register and outputs out of RAX and RDX is real as far as register availability for the coding is concerned, 
but is false with respect to the resulting data movement actually occurring or having runtime impact, and 
as an assembly programmer (or compiler-optimizer writer) one must be aware which scheduling constraints 
are real, and which are not. This neatly captures a key theme related to code optimization for such heavily 
virtualized execution engines: We have found that the biggest performance gains are achieved by giving the 
processor a reasonably full menu of independently executable instructions at all times and then "just getting 
out of its way" , so to speak, rather than trying to micromanage instruction streams which will inevitably be 
rescheduled at runtime in a fashion which is invisible to, and hence beyond the control of, the programmer. 

By way of comparison with a state-of-the-art implementation of left-to-right DIV, our current-best 12.5 
cycles per dividend word compares to 25 cycles for the current GMP release on the same architecture [9] 
(cf. § V, the case study of the DIV_NBY1 implementation). It is possible that classical left-to-right DIV 
algorithms (which the GMP n/1 division implementation represents, albeit in a very cleverly structured 
and highly optimized form) may benefit from the same kind of loop-folding optimization we describe here. 
However, were such a parallelized approach both possible and advantageous for the left-to-right DIV, one 
would expect it to have been employed some time ago, since latency issues related to integer MUL are not 
a new phenomenon. In fact, note that the 2-pass separate remainder/quotient structure of our approach 
here is key to this loop-folding parallclization, since the initial remaindering pass provides the partial- vector 
remainders needed to perform the quotient step in analogous parallel fashion. Time will tell whether classical 
left-to-right DIV can also benefit from a similar kind of optimization. 

Lastly, what to do if q is even? That is quite simple: One starts as described in the analogous Algorithm A 
commentary, by counting the trailing binary zeros of q and saving that value in an integer variable tz q . Then 
one does as follows: 

ALGORITHM D': Modifications to Division procedure needed for even divisor q 

1. Save the bottom tz q bits of x: bsave — x Sz{2 tZq — 1); 

2. Right-shift q and x to obtain q — (j> tz q ) and x 1 = (x ^> tz q ); 

3. The combination of Algorithm A/B and C applied to x' and q' yields a partial remainder r' , 
from which the true remainder is obtained via r = (r' -C tz q ) + bsave; 

4. Algorithm D applied to x' , q' and r' yields the quotient, without any other adjustment required. 



1 This user constraint is alleviated in the next-generation x86 architecture, "Haswell", by way of a new MULX instruction 
which uses a 3-operand RISC-style format. 
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7 Efficient Inverse Power of 2 Modulo q 



We conclude by considering the special case x = 2 P + a. Computing x (mod q) here calls for a binary 
powering ladder, e.g. a left-to-right binary exponentiation starting with seed s = 2 corresponding to the 
leftmost set bit of the exponent p, and parsing the lower-order bits of p in rightward fashion. For each bit 
along the way one squares the current residue s (mod q) and if the current bit = 1 one further performs a 
mod-doubling of the result. At the end one can check divisibility via s = —a (mod q), or compute the full 
remainder r = s + a (mod q). Thus, only |_lg(p)J modmuls are required. 

If using a radix- R Montgomery modmul for this purpose one typically scales the initial seed as s' — s ■ R 
(mod q). Each mod-squaring then is replaced by a scaled variant 

M(s', s') = M(sR, sR) = s 2 R 2 R- 1 = s 2 R (mod q), 

thus the scaling factor R is carried through the sequence of squarings in constant fashion, and we note the 
bit-dependent mod-add does not alter this property, due to the distributivity of modular addition. Thus 
one can perform the powering as usual and at the end transform the output via modmul with R^ 1 (mod q), 
which can be effected with a single call to MMUL_ONEb- 

However, because the dividend involves a power of 2, as does the radix R, we can make the inverse power 
of R introduced by each modmul work to our advantage, and thus entirely avoid the need for the explicit 
i?-scalings. Using that R — 2 b we define a modified power 

p' =p + b, 

which is used to control the mod-doublings in the powering sequence as described above, with one twist: 
Since we will be computing a sequence of negative powers culminating in 2~ p (mod q), we do a mod- 
doubling whenever the current bit of p' is 0, or as in the pseudocode below, whenever the bit of of the logical 
complement pshift := NOT(p') is set. As is the case in normal (positive-exponent) binary powering, we can 
save some work by setting our initial seed not to 2 but rather to 2 X , where x is the largest chunk of bits of p' 
starting with the leftmost set bit which is less than b, the power of 2 in our arithmetic radix. For simplicity 
we here assume that p and q are both < R, although it is easy to accommodate arbitrarily large exponents. 
Within a uintf, datum, bits are indexed as [b-l:0], from left to right: 



ALGORITHM E: Computation of 2 p (mod q) without any explicit radix-scalings. 

Data: Unsigned 6-bit exponent p, q odd, q ■ qinv = 1 (mod 2 ). 
Result: 2~ p (mod q). 

uintb pshift = p + b; // Scale p by power of 2 in the arithmetic radix 

int leadz —[number of leading zero bits in pshift}; 

int ii = b — 1 — leadz; // Index of leftmost set bit in pshift. 

int io = [Starting at i\, smallest bit index such that the bitfield [ii : io] < b[; 

int ichunk = (pshift ^> io); // This is the bitfield [ii : io] 

uintb s — 1 <C (b — ichunk — 1); // The resulting initial seed 

pshift — ~NOT(pshift); // Logical complement of the scaled exponent 

// Loop downward though the bits of the scaled, complemented exponent 
for int i •<— io — 1 to do 

s = MONT_SQRb(s, q, qinv); 
if BIT_TEST(ps/w/i, i) then 
s — s + s; 
if (s > q) s = s — q; 
end 
end 

// Always need a final mod-doubling prior to returning. 

s — s + s; 

if(s > q) s = s — q; 

return s; 



We illustrate once again using our earlier example, the problem of finding the remainder of x — 2 — 1, 
modulo q = 16357897499336320049. The 64-bit mod -inverse qinv = 9366409592816252113. Working 
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through the steps of Algorithm E, we have pshi ft = 977 + 64 = 1041, which in binary form = 10000010001 2 . 
Using R = 2 64 , we have leadz = 53 and i\ = 64 — 1 — 53 = 10. The leftmost 6-bit-wide chunk of 
p is guaranteed to be in [32,63], hence io = i\ — 6 + 1 = 5, and the 6-bit-wide bitfield in question is 
ichunk = (p > 5) = 100000 2 = 32. The resulting initial seed s = (1 <C (64 - 32 - 1)) = (1 « 31) or 2 31 in 
binary-exponential notation. The for-loop in Algorithm E now runs from bit io — 1 = 4 down to bit of the 
remaining portion of NOT(p') (labeled pshift in the pseudocode), yielding the Montgomery-mul, mod-add 
and binary-power sequence given in Table 7. 



Table 7 - Sequence of mod-powers computed in inner loop of Algorithm E, for p — 977, with R — 2 1 



bit 


ith bit 








index i 


of NOT(p') 


function calls 


operation 


pow_out 


4 





MONT_SQR(2 3i ) 


2 3ix2-64 




3 


1 


2 x MONT_SQR(2- 1 ) 


2-2x2-64 x 2 


2~ 68 x2 = 2~ 67 


2 


1 


2 x MONT_SQR(2- 66 ) 


2-67x2-64 x 2 


2~ 198 x 2 = 2~ 197 


1 


1 


2 x MONT_SQR(2- 196 ) 


2-197x2-64 x 2 


2- 458 x 2 = 2- 457 








MONT_SQR(2- 457 ) 


2-457x2-64 


2-978 



The unconditional mod-doubling of the powering-loop output yields s = 2 977 (mod q) — 7143819210136784550. 
Taking the positive-power remainder r = 2 977 (mod q) = 8623243291871090712 - which is just our previously 
computed remainder (2 977 — 1) mod q, plus 1 - it is easily verified that s ■ r = 1 (mod q). This needs just 5 
calls to MONTJ3QR; Note that this example also illustrates the worst-case scenario for the inverse-powering 
scheme, which is when the scaled power p + b has one more bit than p alone, leading to an extra loop 
execution relative to a more-conventional positive-powering ladder. 

In order to make a more precise work comparison, let us compare this to an orthodox 2 P (mod q) 
computation, also using the Montgomery modmul. Here we start with p = 977, which in binary form 
= 1111010001. Using R = 2 64 We again process the leftmost 6 bits of this en bloc; these = 111101 = 61. 
The required powering-loop-entry seed value is thus s = R * 2 61 (mod q) — 696971437655893565. We can 
compute this in similar fashion as we did our seed value for the Algorithm C scaling-factor computation, 
by first computing i? 3 / 2 (mod q) via 64-bit floating-point arithmetic and calling MONT_SQR to obtain R 2 
(mod q), then calling MONT_MUL(i? 2 (mod q), 2 61 ) to obtain s. This is followed by 4 loop iterations, 
summarized in Table 8, and a final MMUL.ONE of the loop output yields 2 1041 ~ 64 = 2 977 (mod <7).Thus, 
while we have one fewer loop iteration in this case (and this is the exception rather than the rule) which 
saves us a mod-squaring, we incur three added modmuls (one squaring, one general multiply, one "downshift 
multiply" by unity) in the pre- and postprocessing scalings. Thus in this example - which admittedly has 
a relatively small exponent p - the inverse-power scheme captured in Algorithm E needs 5 modmuls (all 
squarings), compared to 7 total modmuls (5 of which are squarings) for the positive-power computation. 
The asymptotic cost for both schemes for large p is of course the same: 0(\gp) mod-squarings, and a similar 
number of mod-doublings, the latter of which are of negligible cost in the large-modulus-asymptotic sense. 

Table 8 - Sequence of mod-powers computed in inner loop of standard Montgomery powering ladder to yield 
2 P (mod q), for p = 977, with R = 2 64 . 



bit 


ith bit 








index i 


of p 


function calls 


operation 


pow.out 


3 





MONT_SQR(2 125 ) 


2 l25x2-64 


2 186 


2 





MONT_SQR(2 186 ) 


2186x2-64 


2308 


1 





MONT_SQR(2 308 ) 


2308x2-64 


2552 





1 


2 x MONT_SQR(2 552 ) 


2552x2-64 x 2 


2 1040 x 2 = 2 1041 



If the additive constant a ^ ±1 we must compute the inverse s _1 (mod q) via 6-bit extended gcd in 
order to find the true remainder 2 P + a (mod q) , which negates any savings resulting from the lack of explicit 
i?-related scalings, but if a = ±1 and we only care about divisibility, we are done. As such, the method is 
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ideally suited for finding or verifying factors of Format and Mersenne numbers. 

The above inverse-powering algorithm, together with multiply arithmetic using R = 2 96 , was used by 
the author to discover the fourth known (and currently largest-known) prime factor of the double-Mersenne 
number MM31, the 78-bit factor q = 178021379228511215367151 = 2 • 41448832329225 • M 31 + 1. 0[7] 
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